The notes you are reading now, are the notes I took during the course I took on Datacamp on text mining.
The course is quite nice and well structured, and has a lot of practical examples. But it goes quite fast, so I tend to forget the right syntax, or the right function name that has been used in the examples. So I tought to put them together in a series of examples that one can read, try and use as a reference.
As a dataset we will use tweets. Since I have not found the original csv files from the course, I decided I could download some real tweets by myself. This could prove an interesting project and could give some interesting insights if we download the right tweets. To do this I followed the instructions on this websites
Let’s load the necessary libraries
library("devtools")
library("twitteR")
library("ROAuth")
Now we need to save our keys
secrets <- read.csv("/Users/umberto/Documents/Passwords and Secrets/twitter-keys.csv", stringsAsFactors = FALSE, header = TRUE, sep =",")
api_key <- secrets$api_key
api_secret <- secrets$api_secret
access_token <- secrets$access_token
access_token_secret <- secrets$access_token_secret
setup_twitter_oauth(api_key,api_secret)
## [1] "Using browser based authentication"
search.string <- "#coffee"
no.of.tweets <- 1000
c_tweets <- searchTwitter(search.string, n=no.of.tweets, lang="en")
Now we need to access the text of the tweets. So we do it in this way (we also need to clean up the tweets from special characters that for now we don’t need, like emoticons with teh sapply function.)
coffee_tweets = sapply(c_tweets, function(t) t$getText())
coffee_tweets <- sapply(coffee_tweets,function(row) iconv(row, "latin1", "ASCII", sub=""))
head(coffee_tweets)
## #Authors\n#Read \n#Write\nDrink #coffee :)\nCreate\nInspire\nDream Big\nNever give up & just do it! https://t.co/8iRHl9Czxl
## "#Authors\n#Read \n#Write\nDrink #coffee :)\nCreate\nInspire\nDream Big\nNever give up & just do it! https://t.co/8iRHl9Czxl"
## My #coffee and the #rhythm. \xed\xa0\xbc\xed\xbe\xa7\u2615 #lovemyjob #lovemylife https://t.co/cOMZCPYaOM
## "My #coffee and the #rhythm. #lovemyjob #lovemylife https://t.co/cOMZCPYaOM"
## Tuesday Tunes: Dan Fogleberg https://t.co/h8rG74RKVj #coffee #TuesdayTunes #Fogleberg
## "Tuesday Tunes: Dan Fogleberg https://t.co/h8rG74RKVj #coffee #TuesdayTunes #Fogleberg"
## Phoenix #coffee shop #blog @LivingPretty1 Curry Rivel by @Jennykaneauthor who was very impressed ✍\u2615\xed\xa0\xbd\xed\xb8\x87\xed\xa0\xbe\xed\xb4\x98 #phoenixvibe… https://t.co/hFHN8p8tbG
## "Phoenix #coffee shop #blog @LivingPretty1 Curry Rivel by @Jennykaneauthor who was very impressed #phoenixvibe https://t.co/hFHN8p8tbG"
## Phoenix #coffee shop #blog @LivingPretty1 Curry Rivel by @Jennykaneauthor who was very impressed ✍\u2615\xed\xa0\xbd\xed\xb8\x87\xed\xa0\xbe\xed\xb4\x98 #phoenixvibe… https://t.co/mFPmq3sTgN
## "Phoenix #coffee shop #blog @LivingPretty1 Curry Rivel by @Jennykaneauthor who was very impressed #phoenixvibe https://t.co/mFPmq3sTgN"
## Phoenix #coffee shop #blog @LivingPretty1 Curry Rivel by @Jennykaneauthor who was very impressed ✍\u2615\xed\xa0\xbd\xed\xb8\x87\xed\xa0\xbe\xed\xb4\x98 #phoenixvibe… https://t.co/LcAmh8zBvH
## "Phoenix #coffee shop #blog @LivingPretty1 Curry Rivel by @Jennykaneauthor who was very impressed #phoenixvibe https://t.co/LcAmh8zBvH"
It is interested to see how many parameters we get from the search
str(c_tweets[[1]])
## Reference class 'status' [package "twitteR"] with 17 fields
## $ text : chr "#Authors\n#Read \n#Write\nDrink #coffee :)\nCreate\nInspire\nDream Big\nNever give up & just do it! https:/"| __truncated__
## $ favorited : logi FALSE
## $ favoriteCount: num 0
## $ replyToSN : chr(0)
## $ created : POSIXct[1:1], format: "2017-05-23 20:15:17"
## $ truncated : logi FALSE
## $ replyToSID : chr(0)
## $ id : chr "867111712440238080"
## $ replyToUID : chr(0)
## $ statusSource : chr "<a href=\"http://www.hootsuite.com\" rel=\"nofollow\">Hootsuite</a>"
## $ screenName : chr "AlohaIsleCoffee"
## $ retweetCount : num 0
## $ isRetweet : logi FALSE
## $ retweeted : logi FALSE
## $ longitude : chr(0)
## $ latitude : chr(0)
## $ urls :'data.frame': 0 obs. of 4 variables:
## ..$ url : chr(0)
## ..$ expanded_url: chr(0)
## ..$ dispaly_url : chr(0)
## ..$ indices : num(0)
## and 53 methods, of which 39 are possibly relevant:
## getCreated, getFavoriteCount, getFavorited, getId, getIsRetweet,
## getLatitude, getLongitude, getReplyToSID, getReplyToSN, getReplyToUID,
## getRetweetCount, getRetweeted, getRetweeters, getRetweets,
## getScreenName, getStatusSource, getText, getTruncated, getUrls,
## initialize, setCreated, setFavoriteCount, setFavorited, setId,
## setIsRetweet, setLatitude, setLongitude, setReplyToSID, setReplyToSN,
## setReplyToUID, setRetweetCount, setRetweeted, setScreenName,
## setStatusSource, setText, setTruncated, setUrls, toDataFrame,
## toDataFrame#twitterObj
So there is quite some possibilities here. But we are not actually interested in twitters now, but just in the text tweetsText. (check for example as reference this stackoverflow post).
Since we are going to compare corpora of text, we need a second set of tweets, and, following the example of the course, I decided to download the first 1000 tweets on Tea
search.string <- "#tea"
no.of.tweets <- 1000
t_tweets <- searchTwitter(search.string, n=no.of.tweets, lang="en")
Now we need to access the text of the tweets. So we do it in this way (we also need to clean up the tweets from special characters that for now we don’t need, like emoticons with teh sapply function.)
tea_tweets = sapply(t_tweets, function(t) t$getText())
tea_tweets <- sapply(tea_tweets,function(row) iconv(row, "latin1", "ASCII", sub=""))
head(tea_tweets)
## #assam #tea and @mcvities richtea - what a perfect partnership!!! #successhour https://t.co/dmDF2GcSc1
## "#assam #tea and @mcvities richtea - what a perfect partnership!!! #successhour https://t.co/dmDF2GcSc1"
## RT @JessicaLSimps: #Tea time!! Love my tea \xed\xa0\xbd\xed\xb2\x9a https://t.co/tB1qzW5yee
## "RT @JessicaLSimps: #Tea time!! Love my tea https://t.co/tB1qzW5yee"
## RT @AsilaAR: Drinking a Full Bottle of Lipton White Raspberry Tea! https://t.co/ySWP7KzhJx via @YouTube\n@Lipton #lipton #tea #health #busin…
## "RT @AsilaAR: Drinking a Full Bottle of Lipton White Raspberry Tea! https://t.co/ySWP7KzhJx via @YouTube\n@Lipton #lipton #tea #health #busin"
## #Win #international #WithLoveforBooks #Kindle Fire, Amazon #giftcard, #tea, #chocolate, owl mugs & sweater #giveaway https://t.co/ZoznJm1RdA
## "#Win #international #WithLoveforBooks #Kindle Fire, Amazon #giftcard, #tea, #chocolate, owl mugs & sweater #giveaway https://t.co/ZoznJm1RdA"
## RT @RushAntiques: Beautiful Mother of Pearl decorated Tea Caddy, remnants of original tin lining! #teacaddy #teatime #tea https://t.co/xWyR…
## "RT @RushAntiques: Beautiful Mother of Pearl decorated Tea Caddy, remnants of original tin lining! #teacaddy #teatime #tea https://t.co/xWyR"
## What #beverage get's you through the work day? \n\n#coffee #tea #water #juice #kombucha
## "What #beverage get's you through the work day? \n\n#coffee #tea #water #juice #kombucha"
To do text mining one of the most used library (and the one I will use here) is tm.
library("tm")
First we need to create a vector of texts
coffee_source <- VectorSource(coffee_tweets)
tea_source <- VectorSource(tea_tweets)
Then we need to make a VCorpus of the list of tweets
coffee_corpus <- VCorpus(coffee_source)
tea_corpus <- VCorpus(tea_source)
coffee_corpus
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 1000
So if we want to see the text of a tweet in the corpus we can use
coffee_corpus[[15]][1]
## $content
## [1] "What #beverage get's you through the work day? \n\n#coffee #tea #water #juice #kombucha"
tea_corpus[[15]][1]
## $content
## [1] "SUMMER IS HERE!\n\nhttps://t.co/JFsHIYWgUm\n\n#VanyaCrafts #etsy #summervibes #sunshine #funinthesun #summer #tea https://t.co/7kbMrLwx00"
Now that I how to make a corpus, I can focus on cleaning, or preprocessing, the text. In bag of words text mining, cleaning helps aggregate terms. For example, it may make sense that the words “miner”, “mining” and “mine” should be considered one term. Specific preprocessing steps will vary based on the project. For example, the words used in tweets are vastly different than those used in legal documents, so the cleaning process can also be quite different. (Text Source: Datacamp)
From Data Source
Common preprocessing functions include:
Note that tolower() is part of base R, while the other three functions come from the tm package. Going forward, we’ll load the tm and qdap for you when they are needed. Every time we introduce a new package, we’ll have you load it the first time.
The qdap package offers other text cleaning functions. Each is useful in its own way and is particularly powerful when combined with the others.
Using the c() function allows you to add new words (separated by commas) to the stop words list. For example, the following would add “word1” and “word2” to the default list of English stop words:
all_stops <- c("word1", "word2", stopwords("en"))
You can use the following command to remove stopwords
removeWords(text, stopwords("en"))
Here is an example of stemming
stemDocument(c("computational", "computers", "computation"))
## [1] "comput" "comput" "comput"
Here is an example of using stemming
# Create complicate
complicate <- c("complicated", "complication", "complicatedly")
# Perform word stemming: stem_doc
stem_doc <- stemDocument(complicate)
# Create the completion dictionary: comp_dict
comp_dict <- "complicate"
# Perform stem completion: complete_text
complete_text <- stemCompletion(stem_doc, comp_dict)
# Print complete_text
complete_text
## complic complic complic
## "complicate" "complicate" "complicate"
To clean the Corpus we can define a function that applies several functions on the corpus
clean_corpus <- function(corpus){
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, c(stopwords("en"), "mug", "coffee"))
return(corpus)
}
Then we can try to apply it on our corpus
clean_corp <- clean_corpus(coffee_corpus)
Then we can pring a cleaned-up tweet
clean_corp[[227]][1]
## $content
## [1] "stainless steel 550ml 1354 httpstcogye0unwtae via kitchenhandle httpstcotptjmwxkfd"
and the original one
coffee_corpus[[227]][1]
## $content
## [1] "Stainless Steel #Coffee Mug - 550ml, only $13.54 https://t.co/Gye0UNWTaE via @kitchenhandle https://t.co/tpTJmwXKfD"
So we have removed special characters, punctuation and so on. Not all the words make much sense really (for example twitter usernames) but it should not be a problem since we don’t expect to see them very often in our corpus.
We can use the following code to make a DTM. Each document is represented as a row and each word as a column.
coffee_dtm <- DocumentTermMatrix(clean_corp)
# Print out coffee_dtm data
print(coffee_dtm)
## <<DocumentTermMatrix (documents: 1000, terms: 4026)>>
## Non-/sparse entries: 9146/4016854
## Sparsity : 100%
## Maximal term length: 63
## Weighting : term frequency (tf)
# Convert coffee_dtm to a matrix: coffee_m
coffee_m <- as.matrix(coffee_dtm)
# Print the dimensions of coffee_m
dim(coffee_m)
## [1] 1000 4026
# Review a portion of the matrix
coffee_m[148:150, 2587: 2590]
## Terms
## Docs mandobarista mandobaristadaily mango mannat
## 148 0 0 0 0
## 149 0 0 0 0
## 150 0 0 0 0
You can also transpose a TDM, to have each word as a row and each column as a document.
# Create a TDM from clean_corp: coffee_tdm
coffee_tdm <- TermDocumentMatrix(clean_corp)
# Print coffee_tdm data
print(coffee_tdm)
## <<TermDocumentMatrix (terms: 4026, documents: 1000)>>
## Non-/sparse entries: 9146/4016854
## Sparsity : 100%
## Maximal term length: 63
## Weighting : term frequency (tf)
# Convert coffee_tdm to a matrix: coffee_m
coffee_m <- as.matrix(coffee_tdm)
# Print the dimensions of the matrix
dim(coffee_m)
## [1] 4026 1000
# Review a portion of the matrix
coffee_m[2587:2590, 148:150]
## Docs
## Terms 148 149 150
## mandobarista 0 0 0
## mandobaristadaily 0 0 0
## mango 0 0 0
## mannat 0 0 0
(source Datacamp) Now that you know how to make a term-document matrix, as well as its transpose, the document-term matrix, we will use it as the basis for some analysis. In order to analyze it we need to change it to a simple matrix like we did in chapter 1 using as.matrix.
Calling rowSums() on your newly made matrix aggregates all the terms used in a passage. Once you have the rowSums(), you can sort() them with decreasing = TRUE, so you can focus on the most common terms.
Lastly, you can make a barplot() of the top 5 terms of term_frequency with the following code.
barplot(term_frequency[1:5], col = "#C0DE25")
Of course, you could take our ggplot2 course to learn how to customize the plot even more… :)
So let’s try with out coffee tweets
## coffee_tdm is still loaded in your workspace
# Create a matrix: coffee_m
coffee_m <- as.matrix(coffee_tdm)
# Calculate the rowSums: term_frequency
term_frequency <- rowSums(coffee_m)
# Sort term_frequency in descending order
term_frequency <- sort(term_frequency, decreasing = TRUE)
# View the top 10 most common words
term_frequency[1:10]
## cup love amp day morning can via one drink
## 86 62 59 55 48 47 46 44 41
## great
## 40
# Plot a barchart of the 10 most common words
barplot(term_frequency[1:10], col = "tan", las = 2)
Now let’s make it a bit more pretty with ggplot2…
library(ggplot2)
library(dplyr)
tf <- as.data.frame(term_frequency)
tf$words <- row.names(tf)
tf10 <- as.data.frame(tf[1:10,])
# We need to make the words factors (ordered) otherwise ggplot2 will order the
# x axis alphabetically
tf10 <- mutate(tf10, words = factor(words, words))
ggplot(tf10, aes(x = tf10$words , y = tf10$term_frequency )) + geom_bar(stat = "identity", fill = "tan", col = "black")+ theme_grey()+theme(text = element_text(size=16), axis.title.x=element_blank(),axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+ylab("Words Frequency")
Note that the following command don’t work from Rstudio if you want to use knitr. So the solution is to do it from the console with the following commands
The command will render an html file in the directory where the Rmd file is.
library(rJava)
library(qdap)
Let’s build a word frequency plot with qdap library
frequency <- freq_terms(coffee_tweets, top = 10, at.least = 3, stopwords = "Top200Words")
frequency <- mutate(frequency, WORD = factor(WORD, WORD))
ggplot(frequency, aes(x = frequency$WORD , y = frequency$FREQ )) + geom_bar(stat = "identity", fill = "tan", col = "black")+ theme_grey()+theme(text = element_text(size=16), axis.title.x=element_blank(),axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+ylab("Words Frequency")
Now let’s remove more stopwords
frequency2 <- freq_terms(coffee_tweets, top = 10, at.least = 3, stopwords = tm::stopwords("english"))
frequency2 <- mutate(frequency2, WORD = factor(WORD, WORD))
ggplot(frequency2, aes(x = frequency2$WORD , y = frequency2$FREQ )) + geom_bar(stat = "identity", fill = "tan", col = "black")+ theme_grey()+theme(text = element_text(size=16), axis.title.x=element_blank(),axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+ylab("Words Frequency")
library(wordcloud)
term_frequency[1:10]
## cup love amp day morning can via one drink
## 86 62 59 55 48 47 46 44 41
## great
## 40
word_freqs <- data.frame(term = names(term_frequency), num = term_frequency)
wordcloud(word_freqs$term, word_freqs$num, max.words = 100, colors = "red")
## Warning in wordcloud(word_freqs$term, word_freqs$num, max.words = 100,
## colors = "red"): tuesdaythoughts could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(word_freqs$term, word_freqs$num, max.words = 100,
## colors = "red"): coffeelover could not be fit on page. It will not be
## plotted.
Now we need to remove some words that are clear are appearing while talking about coffee
# Add new stop words to clean_corpus()
clean_corpus <- function(corpus){
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords,
c(stopwords("en"), "brew", "cafe", "coffeetime", "cup", "coffee"))
return(corpus)
}
clean_coffee <- clean_corpus(coffee_corpus)
coffee_tdm <- TermDocumentMatrix(clean_coffee)
coffee_m <- as.matrix(coffee_tdm)
coffee_words <- rowSums(coffee_m)
Now we prepare the right order of words for the wordcloud
coffee_words <- sort(coffee_words, decreasing = TRUE)
coffee_words[1:6]
## love amp day morning can via
## 62 59 55 48 47 46
coffee_freqs <- data.frame (term = names(coffee_words), num = coffee_words)
wordcloud(coffee_freqs$term, coffee_freqs$num, max.words = 50, colors = "red")
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words = 50, :
## amp could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words = 50, :
## cbpcindy could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words = 50, :
## problem could not be fit on page. It will not be plotted.
wordcloud(coffee_freqs$term, coffee_freqs$num, max.words = 100, colors = c("grey80", "darkgoldenrod1", "tomato"))
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : coffeeaddict could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : httpstcopqxaevqg could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : love could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : caramiasg could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : great could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : coffeelovers could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : suziday could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : time could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : day could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : queenbeancoffee could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : elfortney could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : good could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : problem could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : amp could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : selection could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : winewankers could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : like could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : winelover could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : freshroasters could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : shop could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : coffeelover could not be fit on page. It will not be plotted.
RColorBrewer color schemes are organized into three categories:
To change the colors parameter of the wordcloud() function you can use a select a palette from RColorBrewer such as “Greens”. The function display.brewer.all() will list all predefined color palettes. More information on ColorBrewer (the framework behind RColorBrewer) is available on its website.
(Source: datacamp)
The function brewer.pal() allows you to select colors from a palette. Specify the number of distinct colors needed (e.g. 8) and the predefined palette to select from (e.g. “Greens”). Often in word clouds, very faint colors are washed out so it may make sense to remove the first couple from a brewer.pal() selection, leaving only the darkest.
Here’s an example:
green_pal <- brewer.pal(8, "Greens")
green_pal <- green_pal[-(1:2)]
Then just add that object to the wordcloud() function.
wordcloud(chardonnay_freqs$term, chardonnay_freqs$num, max.words = 100, colors = green_pal)
(Source: datacamp)
The command display.brewer.all() will display all palettes. Is a very cool command
display.brewer.all()
Let’s try to use the PuOr palette
# Create purple_orange
PuOr <- brewer.pal(10, "PuOr")
purple_orange <- PuOr[-(1:2)]
And now we can create the wordcloud woith this palette
wordcloud(coffee_freqs$term, coffee_freqs$num, max.words = 100, colors = purple_orange)
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : coffeelover could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : coffeelovers could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : love could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : need could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : wine could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : goodmorning could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : shop could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : espresso could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : mug could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : https could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : freshroasters could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : amp could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : httpstcopqxaevqg could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : tuesdaythoughts could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : httpstcomzmgfib could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : caffeine could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : make could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : choose could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : morning could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : delicious could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : elfortney could not be fit on page. It will not be plotted.
Sometimes not all the words can be plotted. In this case the only solutions are to reduce the number of words or to reduce the scale of the words themselves. For example
wordcloud(coffee_freqs$term, coffee_freqs$num, max.words = 100, colors = purple_orange, scale = c(2,0.3))
Now all the words are in the plots.
Now sometimes single words don’t tell the entire story and is interesting to do the same plot with bigrams (words that appear together in the corpus). The tokenizer from RWeka is very useful.
library(RWeka)
Then we need to get the couples of words (note that the definition give below will give you only bigrams, and not single words anymore).
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tdm.bigram <- TermDocumentMatrix(coffee_corpus, control = list(tokenize = BigramTokenizer))
Then we can get the frequencies of the bigrams
freq <- sort(rowSums(as.matrix(tdm.bigram)), decreasing = TRUE)
freq.df <- data.frame(word = names(freq), freq= freq)
head(freq.df)
## word freq
## https //t https //t 1011
## #coffee https #coffee https 139
## the solution the solution 38
## cup of cup of 37
## of #coffee of #coffee 36
## a #coffee a #coffee 34
Now we can plot the wordcloud
wordcloud(freq.df$word, freq.df$freq, max.words = 50, random.order = F, colors = purple_orange, scale = c(4,0.7))
We need of course first to do a cleanup of the bigrams list. But that is something that goes beyond the notes I am writing. An important point is that if you remove all stop words like “not” you may loose important informations for bigrams (like negations).
Just as a reference here is the code to do wordclouds with trigrams
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tdm.trigram <- TermDocumentMatrix(coffee_corpus, control = list(tokenize= TrigramTokenizer))
freq <- sort(rowSums(as.matrix(tdm.trigram)), decreasing = TRUE)
freq.df <- data.frame(word = names(freq), freq= freq)
head(freq.df)
## word freq
## #coffee https //t #coffee https //t 134
## cup of #coffee cup of #coffee 24
## coffee #coffee https coffee #coffee https 21
## drink more coffee drink more coffee 19
## https //t co/pqxa5ev8qg https //t co/pqxa5ev8qg 19
## is quite simple is quite simple 19
To find common words we need to create two “big” documents of tweets. We need to collapse all tweets together separated by a space
all_coffee <- paste (coffee_tweets, collapse = " ")
all_tea <- paste (tea_tweets,collapse = " ")
all_tweets <- c(all_coffee, all_tea)
Now we convert to a Corpus
# Convert to a vector source
all_tweets <- VectorSource(all_tweets)
# Create all_corpus
all_corpus <- VCorpus(all_tweets)
Now that we have a corpus filled with words used in both the tea and coffee tweets files, we can clean the corpus, convert it into a TermDocumentMatrix, and then a matrix to prepare it for a commonality.cloud(). First we need to define a proper cleaning function that contains words like coffee and tea
clean_corpus <- function(corpus){
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, c(stopwords("en"), "mug", "coffee", "tea"))
return(corpus)
}
Let’s clean the corpus
# Clean the corpus
all_clean <- clean_corpus (all_corpus)
# Create all_tdm
all_tdm <- TermDocumentMatrix(all_clean)
# Create all_m
all_m <- as.matrix(all_tdm)
Now the communality cloud
commonality.cloud(all_m, max.words = 100, colors = "steelblue1")
You can plot a comparison cloud in this way
comparison.cloud(all_m, max.words = 50, colors = c("orange", "blue"), scale = c(3,0.5))
(Source Datacamp) A commonality.cloud() may be misleading since words could be represented disproportionately in one corpus or the other, even if they are shared. In the commonality cloud, they would show up without telling you which one of the corpora has more term occurrences.
To solve this problem, we can create a pyramid.plot() from the plotrix package.
library(plotrix)
all_tdm_m <- all_m
# Create common_words
common_words <- subset(all_tdm_m, all_tdm_m[, 1] > 0 & all_tdm_m[, 2] > 0)
# Create difference
difference <- abs(common_words[, 1] - common_words[, 2])
# Combine common_words and difference
common_words <- cbind(common_words, difference)
# Order the data frame from most differences to least
common_words <- common_words[order(common_words[, 3], decreasing = TRUE), ]
# Create top25_df
top25_df <- data.frame(x = common_words[1:25, 1],
y = common_words[1:25, 2],
labels = rownames(common_words[1:25, ]))
# Create the pyramid plot
pyramid.plot(top25_df$x, top25_df$y,
labels = top25_df$labels, gap = 60,
top.labels = c("Coffee", "Words", "Tea"),
main = "Words in Common", laxlab = NULL,
raxlab = NULL, unit = NULL)
## [1] 5.1 4.1 4.1 2.1
In a network graph, the circles are called nodes and represent individual terms, while the lines connecting the circles are called edges and represent the connections between the terms.
For the over-caffeinated text miner, qdap provides a shorcut for making word networks. The word_network_plot() and word_associate() functions both make word networks easy!
word_associate(coffee_tweets, match.string = c("monday"),
stopwords = c(Top200Words, "coffee", "mug"),
network.plot = TRUE)
## Warning in text2color(words = V(g)$label, recode.words = target.words,
## colors = label.colors): length of colors should be 1 more than length of
## recode.words
## row group unit text
## 1 51 all 51 RT @suziday123: @gigirules7 Monday #coffee Morning #letsdothis #MondayMotivaton @CaraMiaSG @MagnumExotics @Coffee_and_Bean https://t
## 2 62 all 62 Alto Astral #bemorehuman #monday #saturday #goodmorning #overheadsquat #snatch #coffee https://t.co/gUp677aGDp
## 3 477 all 477 RT @badolinaLDN: Monday morning... Our #coffee is so strong it wakes up the neighbours! (sorry neighbours....) https://t.co/GjgqbKMqmD
## 4 748 all 748 When Tuesday after a long weekend still feels like Monday #parenting #coffee #CoffeeAddict #preschooler #coffeetime https://t.co/od8zLwjjdb
##
## Match Terms
## ===========
##
## List 1:
## monday, mondaymotivaton
##
Now that you understand the steps in making a dendrogram, you can apply them to text. But first, you have to limit the number of words in your TDM using removeSparseTerms() from tm. Why would you want to adjust the sparsity of the TDM/DTM?
TDMs and DTMs are sparse, meaning they contain mostly zeros. Remember that 1000 tweets can become a TDM with over 3000 terms! You won’t be able to easily interpret a dendrogram that is so cluttered, especially if you are working on more text.
A good TDM has between 25 and 70 terms. The lower the sparse value, the more terms are kept. The closer it is to 1, the fewer are kept. This value is a percentage cutoff of zeros for each term in the TDM.
Let’s see the dimensions of your coffee tdm
dim(coffee_tdm)
## [1] 3950 1000
Let’s remove some terms
coffee_tdm1 <- removeSparseTerms(coffee_tdm, sparse = 0.95)
dim(coffee_tdm1)
## [1] 3 1000
Let’s see a dendogram now
coffee_tdm1_m <- as.matrix(coffee_tdm1)
coffee_tdm1_df <- as.data.frame(coffee_tdm1_m)
coffee_dist <- dist(coffee_tdm1_df)
coffee_hc <- hclust(coffee_dist)
plot(coffee_hc)
Now let’s make the dendrogram more appealing
library(dendextend)
Now
hcd <- as.dendrogram(coffee_hc)
labels(hcd)
## [1] "love" "amp" "day"
Now let’s work on the appearance
hcd <- branches_attr_by_labels(hcd, c("mondaymorning", "work"), "red")
## Warning in branches_attr_by_labels(hcd, c("mondaymorning", "work"), "red"): Not all of the labels you provided are included in the dendrogram.
## The following labels were omitted:mondaymorningwork
plot(hcd, main = "Better Dendrogram")
Now let’s add rectangular shapes around the clusters
# Add cluster rectangles
plot(hcd, main = "Better Dendrogram")
rect.dendrogram(hcd, k = 2, border = "grey50")
Another way to think about word relationships is with the findAssocs() function in the tm package. For any given word, findAssocs() calculates its correlation with every other word in a TDM or DTM. Scores range from 0 to 1. A score of 1 means that two words always appear together, while a score of 0 means that they never appear together.
To use findAssocs() pass in a TDM or DTM, the search term, and a minimum correlation. The function will return a list of all other terms that meet or exceed the minimum threshold.
findAssocs(tdm, "word", 0.25)
# Create associations
associations <- findAssocs(coffee_tdm, "mug", 0.2)
# View the venti associations
print(associations)
## $mug
## ceramic fathers deserves
## 0.42 0.37 0.34
## epiconetsy etsychaching httpstcoiwovyllqjz
## 0.34 0.34 0.34
## birthday daydrinking disappearing
## 0.32 0.32 0.32
## hockey httpstcocgzitfizv httpstcohcqgsyyv
## 0.32 0.32 0.32
## httpstcoqzukkwrhnp morningkraze morphing
## 0.32 0.32 0.32
## silicone whiskeyandwhineco batman
## 0.32 0.32 0.28
## june away battery
## 0.27 0.22 0.22
## coffeelovers color designs
## 0.22 0.22 0.22
## etsy funny retro
## 0.22 0.22 0.22
## sensitive travel blue
## 0.22 0.22 0.21
## creative kitchenhandle
## 0.21 0.21
library(ggthemes)
# Create associations_df
associations_df <- list_vect2df(associations)[,2:3]
# Plot the associations_df values (don't change this)
ggplot(associations_df, aes(y = associations_df[, 1])) +
geom_point(aes(x = associations_df[, 2]),
data = associations_df, size = 3) +
theme_gdocs()
require(proxy)
coffee_tdm_m <- as.matrix(coffee_tdm)
coffee_cosine_dist_mat <- as.matrix(dist(coffee_tdm_m, method = "cosine"))
what dimensions we have in this matrix?
dim(coffee_cosine_dist_mat)
## [1] 3950 3950
as expected. Let’s check some rows
coffee_cosine_dist_mat[1:5,1:5]
## abroad abwbhlucas accessory account acneskinsite
## abroad 0 1 1 1 1
## abwbhlucas 1 0 1 1 1
## accessory 1 1 0 1 1
## account 1 1 1 0 1
## acneskinsite 1 1 1 1 0
We can do the same calculations using the fact we have sparse matrices
library(slam)
cosine_dist_mat <- crossprod_simple_triplet_matrix(coffee_tdm)/(sqrt(col_sums(coffee_tdm^2) %*% t(col_sums(coffee_tdm^2))))
cosine_dist_mat[1:5,1:5]
## Docs
## Docs 1 2 3 4 5
## 1 1 0 0 0.0 0.0
## 2 0 1 0 0.0 0.0
## 3 0 0 1 0.0 0.0
## 4 0 0 0 1.0 0.9
## 5 0 0 0 0.9 1.0
Tweets 2 and 3 have a similarity score of 0.92, so a very high one. Let’s check them
print(coffee_tweets[[2]])
## [1] "My #coffee and the #rhythm. #lovemyjob #lovemylife https://t.co/cOMZCPYaOM"
print(coffee_tweets[[3]])
## [1] "Tuesday Tunes: Dan Fogleberg https://t.co/h8rG74RKVj #coffee #TuesdayTunes #Fogleberg"
They are indeed very similar being one a retweet of the other.
my.tdm <- TermDocumentMatrix(coffee_corpus, control = list(weighting = weightTfIdf))
my.dtm <- DocumentTermMatrix(coffee_corpus, control = list(weighting = weightTfIdf, stopwords = TRUE))
inspect(my.dtm)
## <<DocumentTermMatrix (documents: 1000, terms: 4678)>>
## Non-/sparse entries: 10650/4667350
## Sparsity : 100%
## Maximal term length: 73
## Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
## Sample :
## Terms
## Docs & #coffee #coffeelover #wine can coffee cup drink need via
## 10 0 0.00000000 0 0 0 0 0 0 0 0
## 315 0 0.00000000 0 0 0 0 0 0 0 0
## 509 0 0.00000000 0 0 0 0 0 0 0 0
## 560 0 0.00000000 0 0 0 0 0 0 0 0
## 721 0 0.00000000 0 0 0 0 0 0 0 0
## 728 0 0.00000000 0 0 0 0 0 0 0 0
## 78 0 0.00000000 0 0 0 0 0 0 0 0
## 800 0 0.01293925 0 0 0 0 0 0 0 0
## 807 0 0.00000000 0 0 0 0 0 0 0 0
## 82 0 0.00000000 0 0 0 0 0 0 0 0
Let’s find (for example) all words that appear twice in any document
findFreqTerms(my.tdm, 200)
## character(0)
cosine_dist_mat <- crossprod_simple_triplet_matrix(my.tdm)/(sqrt(col_sums(my.tdm^2) %*% t(col_sums(my.tdm^2))))
cosine_dist_mat[1:5,1:5]
## Docs
## Docs 1 2 3 4 5
## 1 1.000000e+00 5.174231e-05 4.111195e-05 3.638475e-05 3.638475e-05
## 2 5.174231e-05 1.000000e+00 6.381586e-05 5.647809e-05 5.647809e-05
## 3 4.111195e-05 6.381586e-05 1.000000e+00 4.487477e-05 4.487477e-05
## 4 3.638475e-05 5.647809e-05 4.487477e-05 1.000000e+00 8.798005e-01
## 5 3.638475e-05 5.647809e-05 4.487477e-05 8.798005e-01 1.000000e+00
y <- which(cosine_dist_mat>0.5, arr.in = TRUE)
str(y)
## int [1:2744, 1:2] 1 2 3 4 5 6 4 5 6 4 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:2744] "1" "2" "3" "4" ...
## ..$ : chr [1:2] "Docs" "Docs"
y
## Docs Docs
## 1 1 1
## 2 2 2
## 3 3 3
## 4 4 4
## 5 5 4
## 6 6 4
## 4 4 5
## 5 5 5
## 6 6 5
## 4 4 6
## 5 5 6
## 6 6 6
## 7 7 7
## 81 81 7
## 152 152 7
## 8 8 8
## 9 9 9
## 10 10 10
## 11 11 11
## 12 12 12
## 27 27 12
## 50 50 12
## 97 97 12
## 102 102 12
## 113 113 12
## 115 115 12
## 132 132 12
## 217 217 12
## 335 335 12
## 644 644 12
## 716 716 12
## 815 815 12
## 843 843 12
## 942 942 12
## 996 996 12
## 13 13 13
## 14 14 14
## 15 15 15
## 16 16 16
## 37 37 16
## 368 368 16
## 689 689 16
## 17 17 17
## 220 220 17
## 18 18 18
## 19 19 19
## 144 144 19
## 145 145 19
## 20 20 20
## 21 21 21
## 403 403 21
## 851 851 21
## 916 916 21
## 931 931 21
## 22 22 22
## 206 206 22
## 461 461 22
## 550 550 22
## 776 776 22
## 803 803 22
## 936 936 22
## 23 23 23
## 24 24 23
## 23 23 24
## 24 24 24
## 25 25 25
## 26 26 26
## 12 12 27
## 27 27 27
## 50 50 27
## 97 97 27
## 102 102 27
## 113 113 27
## 115 115 27
## 132 132 27
## 217 217 27
## 335 335 27
## 644 644 27
## 716 716 27
## 815 815 27
## 843 843 27
## 942 942 27
## 996 996 27
## 28 28 28
## 380 380 28
## 863 863 28
## 29 29 29
## 535 535 29
## 953 953 29
## 30 30 30
## 329 329 30
## 31 31 31
## 137 137 31
## 245 245 31
## 32 32 32
## 33 33 33
## 34 34 34
## 35 35 35
## 36 36 36
## 16 16 37
## 37 37 37
## 368 368 37
## 689 689 37
## 38 38 38
## 676 676 38
## 39 39 39
## 40 40 40
## 41 41 41
## 42 42 42
## 43 43 43
## 44 44 44
## 45 45 45
## 80 80 45
## 46 46 46
## 47 47 47
## 124 124 47
## 229 229 47
## 296 296 47
## 355 355 47
## 476 476 47
## 481 481 47
## 572 572 47
## 641 641 47
## 971 971 47
## 997 997 47
## 48 48 48
## 49 49 49
## 12 12 50
## 27 27 50
## 50 50 50
## 97 97 50
## 102 102 50
## 113 113 50
## 115 115 50
## 132 132 50
## 217 217 50
## 335 335 50
## 644 644 50
## 716 716 50
## 815 815 50
## 843 843 50
## 942 942 50
## 996 996 50
## 51 51 51
## 52 52 52
## 53 53 53
## 54 54 54
## 55 55 55
## 597 597 55
## 56 56 56
## 57 57 57
## 58 58 58
## 482 482 58
## 774 774 58
## 958 958 58
## 987 987 58
## 59 59 59
## 60 60 60
## 61 61 61
## 62 62 62
## 63 63 63
## 64 64 64
## 71 71 64
## 73 73 64
## 76 76 64
## 333 333 64
## 358 358 64
## 511 511 64
## 819 819 64
## 824 824 64
## 65 65 65
## 529 529 65
## 846 846 65
## 926 926 65
## 66 66 66
## 67 67 67
## 68 68 68
## 69 69 69
## 371 371 69
## 848 848 69
## 924 924 69
## 70 70 70
## 574 574 70
## 790 790 70
## 809 809 70
## 832 832 70
## 841 841 70
## 935 935 70
## 960 960 70
## 64 64 71
## 71 71 71
## 73 73 71
## 76 76 71
## 333 333 71
## 358 358 71
## 511 511 71
## 819 819 71
## 824 824 71
## 72 72 72
## 74 74 72
## 64 64 73
## 71 71 73
## 73 73 73
## 76 76 73
## 333 333 73
## 358 358 73
## 511 511 73
## 819 819 73
## 824 824 73
## 72 72 74
## 74 74 74
## 75 75 75
## 64 64 76
## 71 71 76
## 73 73 76
## 76 76 76
## 333 333 76
## 358 358 76
## 511 511 76
## 819 819 76
## 824 824 76
## 77 77 77
## 78 78 78
## 79 79 79
## 86 86 79
## 45 45 80
## 80 80 80
## 7 7 81
## 81 81 81
## 152 152 81
## 82 82 82
## 83 83 83
## 84 84 84
## 85 85 85
## 79 79 86
## 86 86 86
## 87 87 87
## 88 88 88
## 741 741 88
## 89 89 89
## 90 90 90
## 91 91 91
## 138 138 91
## 307 307 91
## 92 92 92
## 458 458 92
## 715 715 92
## 979 979 92
## 93 93 93
## 94 94 94
## 719 719 94
## 975 975 94
## 995 995 94
## 95 95 95
## 96 96 96
## 12 12 97
## 27 27 97
## 50 50 97
## 97 97 97
## 102 102 97
## 113 113 97
## 115 115 97
## 132 132 97
## 217 217 97
## 335 335 97
## 644 644 97
## 716 716 97
## 815 815 97
## 843 843 97
## 942 942 97
## 996 996 97
## 98 98 98
## 99 99 99
## 549 549 99
## 100 100 100
## 101 101 101
## 12 12 102
## 27 27 102
## 50 50 102
## 97 97 102
## 102 102 102
## 113 113 102
## 115 115 102
## 132 132 102
## 217 217 102
## 335 335 102
## 644 644 102
## 716 716 102
## 815 815 102
## 843 843 102
## 942 942 102
## 996 996 102
## 103 103 103
## 104 104 104
## 431 431 104
## 495 495 104
## 854 854 104
## 105 105 105
## 838 838 105
## 106 106 106
## 114 114 106
## 107 107 107
## 907 907 107
## 108 108 108
## 109 109 109
## 517 517 109
## 829 829 109
## 929 929 109
## 110 110 110
## 111 111 111
## 112 112 112
## 12 12 113
## 27 27 113
## 50 50 113
## 97 97 113
## 102 102 113
## 113 113 113
## 115 115 113
## 132 132 113
## 217 217 113
## 335 335 113
## 644 644 113
## 716 716 113
## 815 815 113
## 843 843 113
## 942 942 113
## 996 996 113
## 106 106 114
## 114 114 114
## 12 12 115
## 27 27 115
## 50 50 115
## 97 97 115
## 102 102 115
## 113 113 115
## 115 115 115
## 132 132 115
## 217 217 115
## 335 335 115
## 644 644 115
## 716 716 115
## 815 815 115
## 843 843 115
## 942 942 115
## 996 996 115
## 116 116 116
## 117 117 117
## 614 614 117
## 664 664 117
## 118 118 118
## 119 119 119
## 120 120 120
## 424 424 120
## 121 121 121
## 596 596 121
## 122 122 122
## 123 123 123
## 47 47 124
## 124 124 124
## 229 229 124
## 296 296 124
## 355 355 124
## 476 476 124
## 481 481 124
## 572 572 124
## 641 641 124
## 971 971 124
## 997 997 124
## 125 125 125
## 126 126 126
## 576 576 126
## 796 796 126
## 127 127 127
## 331 331 127
## 128 128 128
## 239 239 128
## 274 274 128
## 129 129 129
## 130 130 130
## 141 141 130
## 131 131 131
## 12 12 132
## 27 27 132
## 50 50 132
## 97 97 132
## 102 102 132
## 113 113 132
## 115 115 132
## 132 132 132
## 217 217 132
## 335 335 132
## 644 644 132
## 716 716 132
## 815 815 132
## 843 843 132
## 942 942 132
## 996 996 132
## 133 133 133
## 134 134 134
## 165 165 134
## 434 434 134
## 857 857 134
## 964 964 134
## 135 135 135
## 136 136 136
## 31 31 137
## 137 137 137
## 245 245 137
## 91 91 138
## 138 138 138
## 307 307 138
## 139 139 139
## 140 140 140
## 130 130 141
## 141 141 141
## 142 142 142
## 143 143 143
## 19 19 144
## 144 144 144
## 145 145 144
## 19 19 145
## 144 144 145
## 145 145 145
## 146 146 146
## 147 147 147
## 830 830 147
## 148 148 148
## 278 278 148
## 657 657 148
## 787 787 148
## 954 954 148
## 149 149 149
## 150 150 150
## 151 151 151
## 7 7 152
## 81 81 152
## 152 152 152
## 153 153 153
## 154 154 154
## 155 155 155
## 156 156 156
## 157 157 157
## 158 158 157
## 157 157 158
## 158 158 158
## 159 159 159
## 160 160 160
## 161 161 161
## 162 162 162
## 163 163 163
## 164 164 164
## 134 134 165
## 165 165 165
## 434 434 165
## 857 857 165
## 964 964 165
## 166 166 166
## 167 167 167
## 168 168 168
## 215 215 168
## 652 652 168
## 760 760 168
## 169 169 169
## 170 170 170
## 171 171 171
## 944 944 171
## 172 172 172
## 268 268 172
## 912 912 172
## 173 173 173
## 174 174 174
## 175 175 175
## 441 441 175
## 449 449 175
## 176 176 176
## 177 177 177
## 178 178 178
## 179 179 179
## 180 180 179
## 179 179 180
## 180 180 180
## 181 181 181
## 855 855 181
## 182 182 182
## 203 203 182
## 183 183 183
## 184 184 184
## 185 185 185
## 209 209 185
## 186 186 186
## 187 187 187
## 445 445 187
## 188 188 188
## 189 189 189
## 190 190 190
## 194 194 190
## 252 252 190
## 262 262 190
## 265 265 190
## 289 289 190
## [ reached getOption("max.print") -- omitted 2244 rows ]
print(coffee_tweets[[209]])
## [1] "RT @WiltsArtisans: #Glutenfree #Plum & #cinnamon #slice perfect with #coffee https://t.co/bFIEWFs611"
print(coffee_tweets[[202]])
## [1] "Always getting fresh roasts out to clients within 24 hours. Portland metro area gets delivery next day by me! https://t.co/wQZU2Pos0C"
and we can extract the values of the matrix with
cosine_dist_mat[y]
## [1] 1.0000000 1.0000000 1.0000000 1.0000000 0.8798005 0.8798005
## [7] 0.8798005 1.0000000 0.8798005 0.8798005 0.8798005 1.0000000
## [13] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [19] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [25] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [31] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [37] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [43] 1.0000000 0.9160175 1.0000000 1.0000000 0.9422000 0.8357528
## [49] 1.0000000 1.0000000 0.7129132 0.7129132 0.7025486 0.7025486
## [55] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [61] 1.0000000 1.0000000 0.8807944 0.8807944 1.0000000 1.0000000
## [67] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [73] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [79] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [85] 1.0000000 0.9249570 1.0000000 0.8415018 1.0000000 1.0000000
## [91] 0.8424397 1.0000000 1.0000000 0.9380438 1.0000000 1.0000000
## [97] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [103] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [109] 1.0000000 1.0000000 1.0000000 1.0000000 0.8257923 1.0000000
## [115] 1.0000000 0.7701515 0.5366681 0.5366681 0.5366681 0.5366681
## [121] 0.5366681 0.5366681 0.5366681 0.5366681 0.5366681 1.0000000
## [127] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [133] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [139] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [145] 1.0000000 1.0000000 1.0000000 1.0000000 0.7383801 1.0000000
## [151] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [157] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [163] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [169] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [175] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [181] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [187] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [193] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [199] 1.0000000 0.8806837 1.0000000 1.0000000 1.0000000 1.0000000
## [205] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.8806837
## [211] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [217] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [223] 1.0000000 1.0000000 0.8652017 0.8257923 1.0000000 1.0000000
## [229] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [235] 0.8652017 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [241] 1.0000000 1.0000000 1.0000000 0.8897024 1.0000000 1.0000000
## [247] 1.0000000 0.8651845 1.0000000 1.0000000 1.0000000 1.0000000
## [253] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [259] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [265] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [271] 1.0000000 1.0000000 1.0000000 0.8386937 1.0000000 1.0000000
## [277] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [283] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [289] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [295] 0.8487600 0.8487600 0.8487600 1.0000000 0.8871582 1.0000000
## [301] 0.9354869 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [307] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [313] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [319] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [325] 1.0000000 1.0000000 1.0000000 0.9354869 1.0000000 1.0000000
## [331] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [337] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [343] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [349] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [355] 0.8116850 1.0000000 1.0000000 0.7701515 1.0000000 0.5799378
## [361] 0.5799378 0.5799378 0.5799378 0.5799378 0.5799378 0.5799378
## [367] 0.5799378 0.5799378 1.0000000 1.0000000 1.0000000 1.0000000
## [373] 1.0000000 0.5454840 1.0000000 1.0000000 1.0000000 1.0000000
## [379] 1.0000000 0.8160166 1.0000000 1.0000000 1.0000000 1.0000000
## [385] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [391] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [397] 1.0000000 1.0000000 1.0000000 0.6917096 0.6917096 0.6917096
## [403] 0.6917096 1.0000000 1.0000000 1.0000000 1.0000000 0.9380438
## [409] 1.0000000 1.0000000 0.8897024 1.0000000 1.0000000 0.8160166
## [415] 1.0000000 1.0000000 1.0000000 0.9422000 1.0000000 0.8870227
## [421] 0.8357528 0.8870227 1.0000000 1.0000000 1.0000000 0.7610833
## [427] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [433] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [439] 1.0000000 1.0000000 1.0000000 1.0000000 0.7375277 0.7375277
## [445] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [451] 1.0000000 0.6917096 1.0000000 0.6917096 0.6917096 0.6917096
## [457] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.8566675
## [463] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.8910905
## [469] 0.7842139 1.0000000 1.0000000 1.0000000 0.6262965 0.6766633
## [475] 1.0000000 1.0000000 1.0000000 1.0000000 0.8777785 0.8777785
## [481] 1.0000000 1.0000000 0.5507312 1.0000000 0.8728140 1.0000000
## [487] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [493] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [499] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [505] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [511] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [517] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [523] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [529] 1.0000000 1.0000000 1.0000000 1.0000000 0.9469167 1.0000000
## [535] 0.8330464 1.0000000 0.8728140 1.0000000 1.0000000 1.0000000
## [541] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [547] 1.0000000 1.0000000 1.0000000 1.0000000 0.8396350 1.0000000
## [553] 1.0000000 1.0000000 0.8707482 1.0000000 1.0000000 1.0000000
## [559] 1.0000000 0.8137098 1.0000000 1.0000000 1.0000000 1.0000000
## [565] 0.8566675 1.0000000 0.9208389 1.0000000 1.0000000 1.0000000
## [571] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [577] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [583] 1.0000000 1.0000000 1.0000000 0.9160175 1.0000000 1.0000000
## [589] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [595] 0.8137098 1.0000000 1.0000000 0.5366681 0.5799378 1.0000000
## [601] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [607] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [613] 1.0000000 1.0000000 1.0000000 1.0000000 0.8066407 1.0000000
## [619] 1.0000000 1.0000000 0.8066407 1.0000000 1.0000000 1.0000000
## [625] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.9380438
## [631] 0.9380438 1.0000000 0.5099219 1.0000000 1.0000000 1.0000000
## [637] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [643] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [649] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [655] 1.0000000 0.8676890 1.0000000 1.0000000 1.0000000 1.0000000
## [661] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [667] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [673] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.7535611
## [679] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [685] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [691] 1.0000000 1.0000000 1.0000000 0.5822601 0.5822601 0.8910905
## [697] 1.0000000 0.8520536 1.0000000 1.0000000 1.0000000 0.5822601
## [703] 1.0000000 0.6333826 0.5822601 0.6333826 1.0000000 1.0000000
## [709] 1.0000000 0.5391612 1.0000000 1.0000000 0.8682573 0.9318033
## [715] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [721] 1.0000000 1.0000000 0.8556435 1.0000000 1.0000000 0.8396350
## [727] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.8556435
## [733] 1.0000000 1.0000000 1.0000000 1.0000000 0.7425065 1.0000000
## [739] 0.8396350 0.8396350 1.0000000 1.0000000 1.0000000 1.0000000
## [745] 1.0000000 0.8443960 1.0000000 1.0000000 1.0000000 1.0000000
## [751] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [757] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [763] 1.0000000 1.0000000 0.7926949 0.7926949 1.0000000 0.5366681
## [769] 0.5799378 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [775] 1.0000000 1.0000000 1.0000000 1.0000000 0.5099219 1.0000000
## [781] 1.0000000 0.9247873 1.0000000 0.9247873 1.0000000 1.0000000
## [787] 1.0000000 0.8682573 1.0000000 0.9318033 1.0000000 1.0000000
## [793] 1.0000000 0.8897024 0.8897024 1.0000000 1.0000000 1.0000000
## [799] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [805] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [811] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [817] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [823] 1.0000000 1.0000000 1.0000000 0.8066407 0.8066407 1.0000000
## [829] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [835] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [841] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [847] 1.0000000 1.0000000 1.0000000 1.0000000 0.9476948 1.0000000
## [853] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [859] 1.0000000 0.8566964 1.0000000 0.5011397 0.5545064 0.5545064
## [865] 0.9476948 1.0000000 1.0000000 0.8424397 1.0000000 1.0000000
## [871] 0.5454840 1.0000000 1.0000000 0.6569324 0.7051988 1.0000000
## [877] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [883] 1.0000000 1.0000000 0.5011397 1.0000000 0.5863471 0.5863471
## [889] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [895] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [901] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [907] 0.5421849 0.5421849 1.0000000 0.5545064 0.5863471 1.0000000
## [913] 1.0000000 1.0000000 1.0000000 0.8137098 0.8137098 1.0000000
## [919] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [925] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [931] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [937] 1.0000000 1.0000000 1.0000000 0.9043791 1.0000000 1.0000000
## [943] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [949] 0.8566964 1.0000000 0.7535611 1.0000000 1.0000000 1.0000000
## [955] 1.0000000 0.5366681 0.5799378 1.0000000 1.0000000 1.0000000
## [961] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [967] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [973] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [979] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [985] 0.9511097 1.0000000 0.8095782 0.8095782 1.0000000 1.0000000
## [991] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [997] 1.0000000 1.0000000 1.0000000 0.9318033
## [ reached getOption("max.print") -- omitted 1744 entries ]
dtm <- DocumentTermMatrix(coffee_corpus)
dtm_tfxidf <- weightTfIdf(dtm)
inspect(dtm_tfxidf[1:10, 1001:1010])
## <<DocumentTermMatrix (documents: 10, terms: 10)>>
## Non-/sparse entries: 1/99
## Sparsity : 99%
## Maximal term length: 15
## Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
## Sample :
## Terms
## Docs #handmade #happiness, #happy #happytuesday #happywednesday
## 1 0 0 0 0 0
## 10 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## 6 0 0 0 0 0
## 7 0 0 0 0 0
## 8 0 0 0 0 0
## 9 0 0 0 0 0
## Terms
## Docs #harleyquinn #hawaii #hazelnuts #headford #health
## 1 0 0 0 0 0.0000000
## 10 0 0 0 0 0.0000000
## 2 0 0 0 0 0.0000000
## 3 0 0 0 0 0.0000000
## 4 0 0 0 0 0.0000000
## 5 0 0 0 0 0.0000000
## 6 0 0 0 0 0.0000000
## 7 0 0 0 0 0.0000000
## 8 0 0 0 0 0.4647395
## 9 0 0 0 0 0.0000000
m <- as.matrix(dtm_tfxidf)
rownames(m) <- 1:nrow(m)
### don't forget to normalize the vectors so Euclidean makes sense
norm_eucl <- function(m) m/apply(m, MARGIN=1, FUN=function(x) sum(x^2)^.5)
m_norm <- norm_eucl(m)
### cluster into 10 clusters
cl <- kmeans(m_norm, 10)
table(cl$cluster)
##
## 1 2 3 4 5 6 7 8 9 10
## 13 15 883 7 9 14 32 10 9 8
dtm[cl$cluster == 1,]
## <<DocumentTermMatrix (documents: 13, terms: 4784)>>
## Non-/sparse entries: 177/62015
## Sparsity : 100%
## Maximal term length: 73
## Weighting : term frequency (tf)
findFreqTerms(dtm[cl$cluster==7,], 1)
## [1] "@elfortney:" "@winewankers:"
## [3] "#coffee" "#coffeelover"
## [5] "#wine" "#winelover"
## [7] "coffee." "drink"
## [9] "https://t.co/mz8mg4fi9b" "https://t.co/pqxa5ev8qg"
## [11] "more" "prayer!"
## [13] "problem" "quite"
## [15] "simple." "solution"
## [17] "solution:" "sometimes"
## [19] "the" "your"
inspect(coffee_corpus[which(cl$cluster==7)])
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 32
##
## $`RT @winewankers: The prayer! #coffee #coffeelover #wine #winelover https://t.co/Mz8mG4FI9b`
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 90
##
## $`RT @winewankers: The prayer! #coffee #coffeelover #wine #winelover https://t.co/Mz8mG4FI9b`
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 90
##
## $`RT @winewankers: The prayer! #coffee #coffeelover #wine #winelover https://t.co/Mz8mG4FI9b`
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 90
##
## $`RT @winewankers: The prayer! #coffee #coffeelover #wine #winelover https://t.co/Mz8mG4FI9b`
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 90
##
## $`RT @winewankers: The prayer! #coffee #coffeelover #wine #winelover https://t.co/Mz8mG4FI9b`
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 90
##
## $`RT @winewankers: The prayer! #coffee #coffeelover #wine #winelover https://t.co/Mz8mG4FI9b`
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 90
##
## $`RT @winewankers: The prayer! #coffee #coffeelover #wine #winelover https://t.co/Mz8mG4FI9b`
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 90
##
## $`RT @winewankers: The prayer! #coffee #coffeelover #wine #winelover https://t.co/Mz8mG4FI9b`
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 90
##
## $`RT @winewankers: The prayer! #coffee #coffeelover #wine #winelover https://t.co/Mz8mG4FI9b`
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 90
##
## $`RT @winewankers: The prayer! #coffee #coffeelover #wine #winelover https://t.co/Mz8mG4FI9b`
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 90
##
## $`RT @elfortney: Sometimes the solution to your problem is quite simple. The solution: drink more coffee. #Coffee https://t.co/Pqxa5Ev8qG`
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 135
##
## $`RT @winewankers: The prayer! #coffee #coffeelover #wine #winelover https://t.co/Mz8mG4FI9b`
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 90
##
## $`RT @elfortney: Sometimes the solution to your problem is quite simple. The solution: drink more coffee. #Coffee https://t.co/Pqxa5Ev8qG`
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 135
##
## $`RT @elfortney: Sometimes the solution to your problem is quite simple. The solution: drink more coffee. #Coffee https://t.co/Pqxa5Ev8qG`
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 135
##
## $`RT @elfortney: Sometimes the solution to your problem is quite simple. The solution: drink more coffee. #Coffee https://t.co/Pqxa5Ev8qG`
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 135
##
## $`RT @elfortney: Sometimes the solution to your problem is quite simple. The solution: drink more coffee. #Coffee https://t.co/Pqxa5Ev8qG`
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 135
##
## $`RT @winewankers: The prayer! #coffee #coffeelover #wine #winelover https://t.co/Mz8mG4FI9b`
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 90
##
## $`RT @elfortney: Sometimes the solution to your problem is quite simple. The solution: drink more coffee. #Coffee https://t.co/Pqxa5Ev8qG`
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 135
##
## $`RT @elfortney: Sometimes the solution to your problem is quite simple. The solution: drink more coffee. #Coffee https://t.co/Pqxa5Ev8qG`
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 135
##
## $`RT @elfortney: Sometimes the solution to your problem is quite simple. The solution: drink more coffee. #Coffee https://t.co/Pqxa5Ev8qG`
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 135
##
## $`RT @elfortney: Sometimes the solution to your problem is quite simple. The solution: drink more coffee. #Coffee https://t.co/Pqxa5Ev8qG`
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 135
##
## $`RT @elfortney: Sometimes the solution to your problem is quite simple. The solution: drink more coffee. #Coffee https://t.co/Pqxa5Ev8qG`
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 135
##
## $`RT @elfortney: Sometimes the solution to your problem is quite simple. The solution: drink more coffee. #Coffee https://t.co/Pqxa5Ev8qG`
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 135
##
## $`RT @elfortney: Sometimes the solution to your problem is quite simple. The solution: drink more coffee. #Coffee https://t.co/Pqxa5Ev8qG`
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 135
##
## $`RT @elfortney: Sometimes the solution to your problem is quite simple. The solution: drink more coffee. #Coffee https://t.co/Pqxa5Ev8qG`
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 135
##
## $`RT @elfortney: Sometimes the solution to your problem is quite simple. The solution: drink more coffee. #Coffee https://t.co/Pqxa5Ev8qG`
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 135
##
## $`RT @elfortney: Sometimes the solution to your problem is quite simple. The solution: drink more coffee. #Coffee https://t.co/Pqxa5Ev8qG`
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 135
##
## $`RT @elfortney: Sometimes the solution to your problem is quite simple. The solution: drink more coffee. #Coffee https://t.co/Pqxa5Ev8qG`
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 135
##
## $`RT @elfortney: Sometimes the solution to your problem is quite simple. The solution: drink more coffee. #Coffee https://t.co/Pqxa5Ev8qG`
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 135
##
## $`RT @elfortney: Sometimes the solution to your problem is quite simple. The solution: drink more coffee. #Coffee https://t.co/Pqxa5Ev8qG`
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 135
##
## $`Sometimes the solution to your problem is quite simple. The solution: drink more coffee. #Coffee https://t.co/Pqxa5Ev8qG`
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 120
##
## $`RT @winewankers: The prayer! #coffee #coffeelover #wine #winelover https://t.co/Mz8mG4FI9b`
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 90